Edward Vytlacil
Experimental results were very positive.
Led to rapid expansion in Mexico:
Led to similar programs in Bangladesh, Brazil, Cambodia, Chile, Colombi, Egypt, Guatemala, Honduras, Indonesia, Jamaica, Nicaragua, Panama, Peru, Phillipines, Turkey and the United States
to mother of the household,
to mother of the household,
conditional on school enrollment between 3rd and 11th grades for children less than 18 years old,
higher for
Of 506 poor, rural villages:
We will now analyze effect of PROGRESA on school enrollment.
Download the data set PROGRESA.csv,
Start by loading the following libraries. Install them first if you have not already done so.
read.csv to read the downloaded data into a data.frame.names, dim, head, table and summary.names, dim, head, table and summary.names, dim, head, table and summary.names, dim, head, table and summary. wave sooloca indivill_seq villsize_t
Min. :1.00 Min. : 1 Min. : 1.00 Min. : 1.00
1st Qu.:2.00 1st Qu.:131 1st Qu.: 8.00 1st Qu.: 26.00
Median :3.00 Median :253 Median : 18.00 Median : 39.00
Mean :3.06 Mean :254 Mean : 25.63 Mean : 50.26
3rd Qu.:4.00 3rd Qu.:395 3rd Qu.: 34.00 3rd Qu.: 63.00
Max. :5.00 Max. :491 Max. :210.00 Max. :210.00
progresa1 sooind_id sex1 age1
Min. :0.0000 Min. : 1 Min. :0.000 Min. : 0.00
1st Qu.:0.0000 1st Qu.: 3983 1st Qu.:0.000 1st Qu.: 9.00
Median :1.0000 Median : 7894 Median :0.000 Median :11.00
Mean :0.6255 Mean : 7877 Mean :0.485 Mean :11.33
3rd Qu.:1.0000 3rd Qu.:11768 3rd Qu.:1.000 3rd Qu.:14.00
Max. :1.0000 Max. :15669 Max. :1.000 Max. :99.00
hgc1 poor1 school
Min. :0.000 Min. :1 Min. :0.0000
1st Qu.:2.000 1st Qu.:1 1st Qu.:1.0000
Median :4.000 Median :1 Median :1.0000
Mean :4.029 Mean :1 Mean :0.8275
3rd Qu.:6.000 3rd Qu.:1 3rd Qu.:1.0000
Max. :9.000 Max. :1 Max. :1.0000
Lets investigate the child with the reported age of 0.
The extreme ages appear to be data entry errors.
We could try to impute them, for example, impute that the previous child is age \(7\) in waves \(3\) and \(4\) based on his ages in waves \(1\) and \(5\).
Instead, we will set to missing.
We do not have 74031 independent observations.
How many child-observations are in the data?
Data has 74031 rows (child-wave observations).
How many child-observations are in the data?
Child id is sooind_id.
Use length and unique functions to find that there are 15669 child-observations in the data, of which 9799 are treated and 5870 are control.
Data has 74031 rows (child-wave observations).
There are 15669 child-observations in the data, of which 9799 are treated and 5870 are control.
How many villages?
Data has 74031 rows (child-wave observations).
There are 15669 child-observations in the data, of which 9799 are treated and 5870 are control.
How many villages?
Village id is sooloca.
Use length and unique functions to find that there are 491 villages, of which 308 are treated and 183 are control.
subset function, restrict data to second wave (last baseline wave) and fifth wave (final post-treatment wave) and keep only relevant variables.Using the ifelse command, code as missing any age less than 6 or greater than 18.
Using the ifelse command, create an indicator variable for whether an observation is from the post period, i.e., a variable that equals 1 if the observation is from waves 5.
Create seperate data frames for pre and post treatment observations.
HMISC.tidyverse.Produce table examining balance in covariates between treated and control at baseline.
By randomization, no systematic differences in expectation.
However, some differences could arise by random chance.
| Control | Treated | Diff. | Std Diff | |
| Girl | 0.50 | 0.48 | -0.02 | -0.04 |
| Age | 10.51 | 10.49 | -0.02 | -0.01 |
| Highest Grade | 3.26 | 3.28 | 0.02 | 0.01 |
| Enrolled | 0.87 | 0.87 | 0.002 | 0.01 |
| Control | Treated | Diff. | Std Diff | |
| Girl | 0.50 | 0.48 | -0.02 | -0.04 |
| Age | 10.51 | 10.49 | -0.02 | -0.01 |
| Highest Grade | 3.26 | 3.28 | 0.02 | 0.01 |
| Enrolled | 0.87 | 0.87 | 0.002 | 0.01 |
| Control | Treated | Diff. | Std Diff | |
| Girl | 0.50 | 0.48 | -0.02 | -0.04 |
| Age | 10.51 | 10.49 | -0.02 | -0.01 |
| Highest Grade | 3.26 | 3.28 | 0.02 | 0.01 |
| Enrolled | 0.87 | 0.87 | 0.002 | 0.01 |
return_mean_by_treatment <- function(x){
means.t<-tapply(x,dfPre$treat,mean)
var.t<-tapply(x,dfPre$treat,var)
return(c(means.t,var.t))
}
vars <- c("girl","age1","hgc1","school")
output <- sapply(dfPre[vars],return_mean_by_treatment)
means <- output[1:2,]
diffs <- output[2,]-output[1,]
N1 <- sum(dfPre$treat)
N0 <- sum(1-dfPre$treat)
pooled.sd <- sqrt(((N0-1)*output[3,]+(N1-1)*output[4,])/(N0+N1-2))
std.diffs <- (output[2,]-output[1,])/pooled.sd
results0 <- t(rbind(means,diffs,std.diffs))
colnames(results0)<-c("Control","Treated","Diff.","Std Diff")
varlabels <- c("Girl","Age","Highest Grade","Enrolled")
rownames(results0)<-c(varlabels)
stargazer(results0, type="html", digits=2, title="Balance at Baseline")MeanSchC.b <- with(subset(dfPre, treat==0 & girl==0),tapply(school, hgc1, mean))
MeanSchT.b <- with(subset(dfPre, treat==1 & girl==0),tapply(school, hgc1, mean))
Grade <- as.factor(rep(c(1:10),2))
Group <- c(rep(" Control", 10), rep("Treated", 10))
MeanSch.b <- matrix(c(MeanSchC.b,MeanSchT.b), nrow = 20, ncol = 1)
tab.b <- data.frame(Grade, Group, MeanSch.b)
plotPre.b <- ggplot(tab.b, aes(x = Grade, y = MeanSch.b, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(breaks = seq(from = 0, to = 1, by = 0.1)) +
xlab("Grade Level") +
ylab("Mean School Enrollment")+
ggtitle("Boys: School Enrollment by Grade, Pre Period")
ggplotly(plotPre.b,tooltip="y") Using mean differences on post-treatment data to estimate effect of PROGRESA on school enrollment:
Overall: 0.041,
By sex:
For boys: 0.022,
For girls: 0.06.
mean(dfPost[ dfPost$treat==1,"school"]) - mean(dfPost[ dfPost$treat==0,"school"])
# 0.04063189
mean(dfPost[dfPost$treat==1&dfPost$girl==0,"school"])-mean(dfPost[dfPost$treat==0&dfPost$girl==0,"school"])
# 0.02215276
mean(dfPost[dfPost$treat==1&dfPost$girl==1,"school"])- mean(dfPost[dfPost$treat==0&dfPost$girl==1,"school"])
# 0.05975947Using mean differences on post-treatment data to estimate effect of PROGRESA on school enrollment:
Overall: 0.041,
By sex:
For boys: 0.022,
For girls: 0.06.
Mean difference estimator justified by randomization, no selection bias.
Can take mean difference conditional on any covariate not effected by the treatment, including any baseline characteristic.
MeansC <- with(dfPost,tapply(school,list(treat,girl,hgc1),mean))
dimnames(MeansC)[[1]] <-c("Control","Treated")
dimnames(MeansC)[[2]] <-c("Boys","Girls")
EffSchBoy <- (MeansC[2,1,] - MeansC[1,1,])
EffSchGirl <- (MeansC[2,2,] - MeansC[1,2,])
Grade <- as.factor(c(1:10, 1:10))
EffSch <- matrix(c(EffSchBoy, EffSchGirl), nrow = 20, ncol = 1)
sex <- c(rep("Boy", 10), rep("Girl", 10))
tabEff <- data.frame(Grade, sex, EffSch)
plotEff <- ggplot(tabEff, aes(x = Grade, y = EffSch, fill = sex)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 10) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(breaks = seq(from = -0.1, to = 0.1, by = 0.05)) +
xlab("Grade Level") +
ylab("Mean School Enrollment") +
ggtitle(" Treatment Effects Estimates by Grade and Sex")
ggplotly(plotEff,tooltip="y")Some results surprising, e.g. estimated effect on boys in first grade is -0.0676692.
Random sampling noise?
With i.i.d. data, \(\mbox{Var}(\bar{X}_N)=\frac{\sigma^2}{N}\), where \(\sigma^2\) is the variance of each \(X_i\).
The higher the number of observations, the lower the sampling noise.
Should we think of sample size as number of children in a sex, treatment status, grade cell? number of villages in a cell?
person.counts <- with(dfPost,tapply(sooind_id,list(treat,girl,hgc1),function(x){length(unique(x))}))
dimnames(person.counts)[[1]] <-c("Control","Treated")
dimnames(person.counts)[[2]] <-c("Boys","Girls")
Grade <- as.factor(rep(c(1:10),2))
Group <- c(rep(" Control", 10), rep("Treated", 10))
numbers.boys <- matrix(c(person.counts[1,1,],person.counts[2,1,]), nrow = 20, ncol = 1)
numbers.girls<- matrix(c(person.counts[1,2,],person.counts[2,2,]), nrow = 20, ncol = 1)
tab.boys <- data.frame(Grade, Group, numbers.boys)
tab.girls <- data.frame(Grade, Group, numbers.girls)
plot.n.boys <- ggplot(tab.boys, aes(x = Grade, y = numbers.boys, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Boy Observations")+
ggtitle("Number of Boy Observations, Treated and Control")
plot.n.girls <- ggplot(tab.girls, aes(x = Grade, y = numbers.girls, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Girl Observations")+
ggtitle("Number of Girl Observations, Treated and Control")
village.counts <- with(dfPost,tapply(sooloca,list(treat,girl,hgc1),function(x){length(unique(x))}))
dimnames(village.counts)[[1]] <-c("Control","Treated")
dimnames(village.counts)[[2]] <-c("Boys","Girls")
numbers.boys.v <- matrix(c(village.counts[1,1,],village.counts[2,1,]), nrow = 20, ncol = 1)
numbers.girls.v<- matrix(c(village.counts[1,2,],village.counts[2,2,]), nrow = 20, ncol = 1)
tab.boys.v <- data.frame(Grade, Group, numbers.boys.v)
tab.girls.v <- data.frame(Grade, Group, numbers.girls.v)
plot.n.boys.v <- ggplot(tab.boys.v, aes(x = Grade, y = numbers.boys.v, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Villages with Boy Observations")+
ggtitle("Number of Villages with Boy Observations")
plot.n.girls.v <- ggplot(tab.girls.v, aes(x = Grade, y = numbers.girls.v, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Villages with Girl Observations")+
ggtitle("Number of Villages with Girl Observations")Estimated effects:
positive overall;
larger for girls than for boys;
vary widely across sex-grade cells.
To what degree are these results real, versus driven by sampling noise?
Next time: hypothesis testing!
Econ 123: Intermediate Econometrics and Data Analysis